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ABSTRACT 

Objective We investigated the common-disease relevant 
information obtained from sequencing compared with 
that reported from genotyping arrays. 
Materials and methods Using 187 publicly available 
individual human genomes, we constructed genomic 
disease risk summaries based on 55 common diseases 
with reported gene— disease associations in the research 
literature using two different risk models, one based on 
the product of likelihood ratios and the other on the 
allelic variant with the maximum associated disease risk. 
We also constructed risk profiles based on the single 
nucleotide polymorphisms (SNPs) of these individuals 
that could be measured or imputed from two common 
genotyping array platforms. 
Results We show that the model risk predictions 
derived from sequencing differ substantially from those 
obtained from the SNPs measured on commercially 
available genotyping arrays for several different 
non-monogenic diseases, although high density 
genotyping arrays give identical results for many 
diseases. 

Conclusions Our approach may be used to compare the 
ability of different platforms to probe known genetic risks 
disease by disease. 
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OBJECTIVE 

Access to a patient's genome offers amazing 
possibilities in clinical medicine. Researchers are 
racing to develop the biomedical knowledge and 
tools capable of turning the mass of individual 
sequence data into clinically actionable findings. 
Recent work has contrasted different direct-to- 
consumer large-scale genotyping services using 
single nucleotide polymorphism (SNP) arrays, 1 and 
many individuals are receiving health related 
information through this route. At the same time, 
whole genome sequencing has been used to directly 
implicate particular genetic variants in disease, 2 3 
while others have reported on methods for clinical 
use of a full human genome sequence. 4 5 As the 
price of large-scale sequencing continues to drop, it 
is worthwhile to determine if the findings of whole 
genome sequencing substantially differ from those 
available through commercially available geno- 
typing arrays and to what extent the results 
depend on the diseases being investigated. 

Large-scale sequencing provides more genetic 
information than even the highest density arrays, 
and the cost difference between technologies is easy 
to calculate. However, determining the relative 
benefits for any individual of a genetic-based test 
for disease prognosis is substantially more difficult. 
For any particular disease of interest, disease 



prediction relies on the interplay of many factors 
including disease prevalence, the predictive power 
of disease associated variants, the frequency of the 
disease associated genetic variations (eg, allele 
frequency), the history of environmental exposures 
that may alter disease risk, the existence or absence 
of treatments for the disease, and many other 
factors. We cannot address all of these issues; 
however, we can provide some core insight into the 
differences in information provided by separate 
technologies. We can compare risk models derived 
from genotyping arrays with those obtained from 
more complete sequencing. If genotyping arrays 
and sequencing provide substantially different risk 
predictions, then sequencing has an information 
advantage over genotyping arrays at the level of 
individual diseases. If the reported risks are not 
substantially different, then full sequencing 
may not currently provide much useful clinical 
information over genotyping arrays. This is a rela- 
tively trivial exercise for monogenic (Mendelian) 
diseases: either a particular technology measures 
the variant with accuracy or it does not. However, 
complex diseases, with many different loci 
associated with disease risk, are the topic of this 
paper. 

Because many different SNPs at different loca- 
tions in the genome have been associated with 
separate diseases and each is associated with 
different variations in disease risk, it is important to 
consider how to combine the results of genotyping 
at multiple disease associated loci for each disease/ 
condition. We use two different models for inte- 
grating multiple disease associated genetic variants. 
We have previously suggested using likelihood 
ratios and modeling each measured disease— variant 
association as an independent test for the disease 
using the product of the likelihood ratios to combine 
results, described in greater detail in previous 
publications. 4 5 This has some advantage over other 
approaches which model the independence of risk 
contribution, particularly when ORs are used to 
approximate risk ratios, and when large numbers 
of variants are combined, extremely large, nonsen- 
sical values can be reported. However, this is 
not the only model of genetic risk of disease. 
Instead of looking at each genetically independent 
variant measured as an independent test for disease, 
we can assume that a single variant establishes 
most of the genetic risk in each individual, but that 
the particular genetic locus driving risk varies from 
individual to individual. This model uses the 
maximum likelihood ratio for the variants measured. 
This is analogous to the idea of a physical chain 
being only as strong as its weakest link; in our 
model the genetic risk is approximated by the risk 
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conferred by the single allele with the largest effect size, that is 
the most damaging allele. The variant with the single largest 
likelihood ratio is assumed to confer all the risk, and any genetic 
contribution from any other variants is ignored. This is biased 
toward the larger effect sizes and more deleterious alleles by 
definition. Of course, neither of these models should be taken as 
a perfect predictor of genetic risk; however, they provide parsi- 
monious models. They make no assumptions about epistatic 
interactions that have not yet been validated. Importantly, they 
do not incorporate any other clinical covariate that may affect 
the likelihood of a particular disease diagnosis, such as envi- 
ronmental or demographic features, although one advantage of 
these approaches, which are both fundamentally Bayesian in 
formulation, is that additional diagnostic features, particularly 
when they are independent of the genetic association with 
disease, may be easily integrated into a more complex version of 
the model. 7 8 

Only a handful of individuals have been fully sequenced in 
great depth. However, the 1000 Genomes Project 9 provides 
greater sequence data with better informed imputation across 
a number of individuals of different ethnic backgrounds using 
a variety of sequencing technologies at different research 
centers. We can use the preliminary release of these data as 
a source of sequences to generate clinical risk reports based on 
our two models, the product of likelihood ratios and maximum 
likelihood ratio. At the same time, we can also use the subset of 
variants reported from common genotyping arrays, either 
through direct measurement or imputed from the SNPs 
measured in arrays, and compare them with clinical reports 
from the fuller sequencing data. For our analysis, we focus 
only on two commercially available genotyping platforms. It is 
certainly possible to design a custom array around specific 
variants of interest, including rare variants of clinical interest, 
or to laboriously employ targeted use of PCR to examine all 
the relevant variants desired, a strategy followed by many of 
the companies carrying out personal genomic testing. 10 This 
has the advantage of providing a clinical profile based on 
variants vetted through expert curation (eg, Hsu et al), 11 but 
has the disadvantage of some loss of flexibility if there are 
differences in opinion concerning which variants to include in 
the analysis. 

We have found that even if a large amount of data can be 
imputed from a very high-density genotyping array, the clinical 
risk assessment using both the product of the likelihood ratios 
model and the maximum likelihood ratio differs substantially 
between the results obtained using two common genotyping 
arrays (Illumina Omni and Affymetrix 500k) and the fuller 
sequence data from 1000 Genomes for a number of diseases. This 
suggests that clinical interpretation of the results of genotyping 
using an 'off-the-shelf array is likely to lack important infor- 
mation relevant to a patient's health. 

METHODS 

As explained in Ashley et at and their associated supplementary 
materials, we have compiled an extensive database of published 
genetic associations with disease from over 2800 research 
publications. We filter out any reported association that is not 
significant: in a candidate gene study the p value must be <0.05 
and in a genome-wide study, the p value must be <0. 00001. 
Reported associations can also be filtered out for a variety of 
quality control reasons, including the fact that many do not 
report the actual risk associated allele, or do not report enough 
information to calculate a likelihood ratio. 



To maximize the independence assumptions needed in our 
approach, we merge results reported from multiple SNPs for 
the same condition that are in strong linkage disequilibrium 
with one another in the same haploblock, as defined by the 
CEU population in HapMap, and then take the most signifi- 
cant. This could introduce some bias toward larger effect sizes, 
but it is necessary to combine studies that use slightly 
different genotyping technology and are indirectly measuring 
the same variation, as we do not want to double count closely 
linked SNPs. Here we operate under the assumption that the 
most significant of a set of linked SNPs is more closely asso- 
ciated with some underlying, perhaps directly causal, variant, 
common in many studies which report the most strongly 
associated variant when a group is linked with a disease in 
a genome-wide association study. For this study we have 
focused on genetic associations for 55 important diseases, as 
many of the conditions in our database are either not at the 
right level of specificity (eg, genetic risk associations for 
cancers in general), not particularly clinically relevant (eg, eye 
color), or related to medication response, something which 
can be dealt with more comprehensively by experts in 
pharmacogenomics . 

The 187 genomes were taken from the 1000 Genomes Project 
pilot 1 and pilot 2 studies, 9 downloaded on July 1, 2010. These 
genomes represent a variety of ethnic groups and sequencing 
technologies, and include related family members, all factors 
that can influence the results of any comparative analysis. At the 
same time, although these are extensive genetic sequences, they 
have not been analyzed to the depth reported for other indi- 
viduals. 12 13 However, this is a unique resource on sequencing 
data for individuals, and although genetic association studies are 
biased toward particular populations (eg, CEU), we will use this 
as the standard of comparison. 

The 187 sequences we have collected are derived from 
a variety of high throughput sequencing technologies and 
centers. 9 Although we do not have access to basic genotyping 
array studies performed on these individuals, we can infer the 
results from a theoretical array from the more complete 
sequence information. We can examine all the SNPs measured by 
the array platform of interest and use the sequence calls from 
the human sequence as the reported genotypes from a theoret- 
ical genotyping array. Also, we therefore automatically exclude 
any experimental differences where the genotyping array does 
not match the sequence data. We have chosen to compare the 
set of variants measured and potentially imputable by the very 
commonly used Affymetrix GeneChip Human Mapping 500K 
Array Set (Affymetrix 500k), which measures approximately 
500 000 SNPs, and the very high coverage Human Omnil-Quad 
Beadchip (Illumina Omni), which measures over one million 
SNPs. 

Genotyping arrays are frequently not used only for the vari- 
ants that they directly measure, as often variations are in strong 
linkage disequilibrium with one another and one can be accu- 
rately imputed from measuring another. To increase the coverage 
of our theoretical genotyping arrays, we also use the reported 
sequence genotype for any variant which can be imputed from 
those measured on the array platform with >75% accuracy 
using the state-of-art MACH imputation software with CEU 
phasing data in the HapMap as the reference. 14 This is not ideal 
for non-Caucasian individuals, but our analysis is not meant to 
provide perfect predictions, just illustrative results. 

To model potential interactions between disease associated 
alleles, we view each associated variant (per haploblocks, as 
above) as providing an independent genetic test for the 
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associated disease. For each variant we compute a likelihood 
ratio: 

, „ . Probability of genotype in diseased person 

Likelihood Ratio = , , - — J — — - — 

Frobability or genotype in nondiseased person 

_ TP _ P(G|D) 
P(G| !D) 

(equation 1) 

This enables the creation of a flexible likelihood ratio for any 
conceivable genotype combination of homozygote or heterozy- 
gote alleles, in contrast with many risk models which compare 
only two sets of genotypes, grouping heterozygotes with one of 
the homozygote genotypes. The genetic variants are free from 
genetic direct linkage, as we look only at SNPs in distinct and 
independent haploblocks and then assume that they are inde- 
pendent tests for disease. We can then model the overall likeli- 
hood ratio of disease simply as the product of likelihood ratios 
from each individual disease associated variant to provide the 
product model for likelihood ratios, depicted graphically in 
previous work as a nomogram. 7 Genetic measurements at vari- 
ants that are not associated with increased or decreased likeli- 
hood of disease are assumed to be uninformative tests, with 
a likelihood ratio of unity, and are therefore excluded. 

In our alternative maximum likelihood ratio model, we 
assume that the significant contribution to disease risk arises 
from only the variant with the largest effect size. A likelihood 
ratio is computed as described above in equation 1 for each 
genotype at each relevant locus for an individual and the single 
largest association with disease is used, the maximum likelihood 
ratio. This is roughly proportional, although not exactly equiv- 
alent, to taking the maximum OR or RR. 

RESULTS 

To compare genotyping array based results with the sequence 
derived likelihood ratios, we compute the differences in the 
natural logarithm of likelihood ratio for each of the 55 diseases, 
for each of the 187 individuals. The results of these comparisons 
are shown in figures 1 and 2. The boxplots show the difference 
between the risk profiles derived from the genotyping arrays and 
the risk profiles derived from the genome sequence for each 
disease for each of the 187 individuals. Wider boxes correspond 
to greater variation in the difference, and the median difference 
is indicated by a vertical bar. To allow comparison on amount of 
genetic information available for each disease, the number of 
SNPs contributing to the risk profile is also shown. 

As is apparent from the lists of numbers in the figures, the 
number of SNPs used in each model is not always a simple 
whole number; instead, an average number of usable SNPs 
across individuals is reported. This fractional number can occur 
for several reasons including a lack of accurate reporting for all 
possible genotypes from the genetic association studies used, or 
a missing or ambiguous sequence call at a key SNP. In such cases, 
it is taken as an uninformative or inclusive test for disease 
association, and a likelihood ratio of 1.0 is assumed for that 
particular variant— disease association. 

The root mean square (RMS) difference in the log likelihood 
ratios within a disease averaged across the 55 diseases is shown 
in table 1. This gives a quantitative, overall summary of vari- 
ability between platforms and shows the relative contribution of 
imputation. These 55 diseases were chosen to represent a wide 
range of conditions, but the 187 individuals are not entirely 
representative of any particular population. 



Some potential biases are observable in the figures. In the 
model that is derived from the maximum (reported) likelihood 
ratio, them is a strong skew toward a result suggesting increased 
risk compared with the product of likelihood ratios model which 
combines risk and protective alleles, attenuating the derived 
likelihood ratio. In other words, the median tends to be higher in 
the max model compared with the product model. However, as 
can be seen in the plots on the right of each figure, this is not 
universally the case, and it may seem counter-intuitive that the 
model using the maximum likelihood ratio can show consistently 
lower reported risk relative to the product of likelihood ratios for 
a disease. This is because a genotyping array may not measure 
any of the SNPs associated with that disease and thus all indi- 
viduals will have a likelihood ratio of 1.0, or no information. 
However, most or all of the individuals sequenced may actually 
have the protective allele, and thus when full sequence infor- 
mation is available, the risk derived from the maximum likelihood 
ratio may actually be less. This may be anticipated as an artifact 
of our patient population, as we know that the subjects who 
have been sequenced are biased toward older individuals, free 
from disease, and perhaps selectively biased toward possessors of 
alleles that have protected them from disease into middle age. In 
general, meta-analyses, such as our synthesis of published 
association studies, are subject to issues in reporting bias toward 
positive findings over negative ones and toward over-inflated 
effect size estimates. 15 16 

Although the absolute differences in the clinical profile are 
important, a key parameter is the relative variability in disease 
likelihood between that reported based on the genotyping arrays 
and that derived from the more extensive sequence data. In 
tables 2 and 3 we show these relative differences (equation 2) for 
each of the 55 diseases, for the SNPs derived from the geno- 
typing arrays, as explained for figures 1 and 2. For each disease, 
we show the RMS of the difference in log likelihood ratio 
between what is derived from the genotyping array SNPs over 
the RMS of the log likelihood ratio derived from the variants in 
the sequence data (equation 1). For each disease, d, and each 
patient, p, in the set of patients P, we compare the RMS 
difference in the log likelihood ratio derived from sequence data, 
LLRseq, and that derived from the variants measured or imput- 
able from the genotyping array, LLRgen. 



V 


/sum^LLRse^ — LLRge« ?i ^ 


V 


'Sum I 

t> 


^LLRse^) 



A value of > 1.0 in tables 2 and 3 indicates that the variability 
in the clinical profile for that disease as reported from the SNPs 
on the genotyping array relative to that from sequencing is 
greater than the variability between individuals. We can see that 
for many diseases, the variability between the results provided 
by sequencing and genotyping SNPs is a large fraction of the 
variability between individuals, and depending on the model and 
genotyping array, may actually exceed the variability between 
individuals. 

When using a richer model that allows many individual 
variants to contribute to adjustments in the risk, in this case the 
product model, the medically relevant results from the geno- 
typing arrays differ substantially from those derived from 
sequencing across many diseases, even with the very high 
coverage of the Illumina Omni platform and a liberal allowance 
for imputation. For diseases like Alzheimer's and type II diabetes 
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Product of LR's 



Max of LR's 



Abdominal aortic aneurysm 
Acute lymphoblastic leukemia 
Acute myeloid leukemia 
Adenocarcinoma 
Adolescent idiopathic scoliosis 
Age related macular degeneration 
Alcohol dependence 
Allergic rhinitis 
Alzheimer's disease 
Amyotrophic lateral sclerosis 
Ankylosing spondylitis 
Anorexia 
Asthma 
Atrial fibrillation 
Autoimmune thyroid disease 
Biliary tract cancer 
Bipolar disorder 
Bladder cancer 
Bronchopulmonary dysplasia 
Cardiac hypertrophy 
Cardiovascular disease 
Cataract 
Celiac disease 
Cerebral malaria 
Chronic lymphocytic leukemia 
Chronic obstructive pulmonary disease 
Colorectal cancer 
Crohn's disease 
Depression 
Eczema 
Epilepsy 
Esophageal cancer 
Gastric cancer 
Glioblastoma 
Graves' disease 
Hodgkin lymphoma 
Hypertension 
Hypertrophic cardiomyopathy 
Migraine 
Multiple myeloma 
Multiple sclerosis 
Obesity 
Osteoarthritis 
Parkinson's disease 
Psoriasis 
Rheumatoid arthritis 
Schizophrenia 
Stroke 

Systemic lupus erythematosus 
Thyroid cancer 
Tourette syndrome 
Type 1 diabetes 
Type 2 diabetes 
Ulcerative colitis 
Venous thrombosis 
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Log likelihood ratio difference Log likelihood ratio difference 

Figure 1 Difference in likelihood ratios (LRs) derived from the genome sequence and genotyping array reportable variants from the Affymetrix 500k 
platform. Differences in natural logarithm of likelihood ratios derived for 55 diseases for the single nucleotide polymorphisms (SNPs) measured on the 
Affymetrix 500k platform, compared with the likelihood ratios derived from the SNPs in the genome sequence for the 187 genomes examined. For each 
disease, the log likelihood ratio reported by the array based SNPs is subtracted from the sequence derived value for each individual. The boxplot shows 
the median as a dark bar, with the box width showing the center quartiles, and the whiskers showing the outer quartiles; extreme outliers are 
excluded. The results using the product of likelihood ratios for all associated variants is depicted in the plot on the left, while the results from the 
maximum likelihood ratio are shown in the plot on the right; note that the horizontal scales are different. To the left of each disease bar are two 
numbers. The lighter colored, left number indicates the number of SNPs used in the likelihood ratio derived from the full sequence, while the darker 
number to the right indicates the number of SNPs used in the likelihood ratios derived from the SNPs directly measured by the Affymetrix 500k 
platform. 



with many associated variants not covered by the genotyping 
array, the overall likelihood ratios can vary dramatically by as 
much as a factor of 20 (shown as a difference in logs of three in 
figures 1 and 2). 

DISCUSSION 

There are some limitations to our work. We focus only on the 
disease associations published in the literature from findings that 



have been reported as statistically significant, from studies 
primarily relying on relatively common genetic variants. 
Undoubtedly, many genetic variants that contribute to disease 
risk remain to be discovered, and these will help explain the 
heritability of many common diseases. 17 18 However, our results 
demonstrate that whole genome sequencing shows markedly 
different results when only the currently known gene— disease 
associations are studied, and before consideration of the even 
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Product of LR's 



Max of LR's 



Abdominal aortic aneurysm 
Acute lymphoblastic leukemia 
Acute myeloid leukemia 
Adenocarcinoma 
Adolescent idiopathic scoliosis 
Age related macular degeneration 
Alcohol dependence 
Allergic rhinitis 
Alzheimer's disease 
Amyotrophic lateral sclerosis 
Ankylosing spondylitis 
Anorexia 
Asthma 
Atrial fibrillation 
Autoimmune thyroid disease 
Biliary tract cancer 
Bipolar disorder 
Bladder cancer 
Bronchopulmonary dysplasia 
Cardiac hypertrophy 
Cardiovascular disease 
Cataract 
Celiac disease 
Cerebral malaria 
Chronic lymphocytic leukemia 
Chronic obstructive pulmonary disease 
Colorectal cancer 
Crohn's disease 
Depression 
Eczema 
Epilepsy 
Esophageal cancer 
Gastric cancer 
Glioblastoma 
Graves' disease 
Hodgkin lymphoma 
Hypertension 
Hypertrophic cardiomyopathy 
Migraine 
Multiple myeloma 
Multiple sclerosis 
Obesity 
Osteoarthritis 
Parkinson's disease 
Psoriasis 
Rheumatoid arthritis 
Schizophrenia 
Stroke 

Systemic lupus erythematosus 
Thyroid cancer 
Tourette syndrome 
Type 1 diabetes 
Type 2 diabetes 
Ulcerative colitis 
Venous thrombosis 
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Figure 2 Difference in likelihood ratios (LRs) derived from the genome sequence and genotyping array reportable variants from the lllumina Omni 
platform. Differences in log likelihood ratios derived for 55 diseases from the SNPs measured on the lllumina Omni platform, compared with the 
likelihood ratios derived from the SNPs in the full genome sequence for the 187 genomes examined. The results are presented as in figure 1. 



Table 1 Root mean squared (RMS) difference in log likelihood ratios 
(LRs) between the those derived from the full sequences and those 
derivable from genotyping arrays, averaged across diseases 



Simulated Averaged RMS log 

platform SNP calling Risk model likelihood difference 



Affymetrix 500k 


Measured 


Product of LRs 


0.895 






Maximum of LRs 


0.384 




Measured and 


Product of LRs 


0.514 




imputable 


Maximum of LRs 


0.204 


lllumina Omni 


Measured 


Product of LRs 


0.519 






Maximum of LRs 


0.203 




Measured and 


Product of LRs 


0.292 




imputable 


Maximum of LRs 


0.102 



SNP, single nucleotide polymorphism. 



greater advantage provided by examining unique variations that 
may be associated with other deleterious changes, including the 
introduction of early stop codons or amino acid substitutions 
predictive of deleterious effects. 19 In this work, we considered 
two different, plausible, models of combining disease-associated 
variants, and they give similar results, but many other models 
are possible. Our work further highlights the need for more 
comprehensive, multivariate models of disease risk, including 
possible epistatic interactions, 20 and these more complex models 
will be developed as the research community continues to 
investigate the contributions of genetic variations in disease. 

The clinical use of genomic data is in its infancy. However, 
our results suggest that even with the currently limited knowl- 
edge of gene— disease associations, genome sequencing provides 
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Table 2 Relative difference between clinical profiles from genotyping 
array single nucleotide polymorphisms (SNPs) and sequencing by 
disease, using the product of likelihood ratio model 



°" e< " e ■ 


Asymetrix 

SOOK 
Measured 


Affymetrix 
SOOK 
Imputed 


lllumina 
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Measured 


lllumina 
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Abdominal sortie aneurysm 


0 83 


0 00 


0 26 


0 00 


Acute lymphoblastic leukemia 
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0 00 
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1 00 


000 
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1 00 
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1 00 
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Alzheimer s disease 


0 81 


0 6 1 


0 48 


0 43 


Amyotrophic lateral sclerosis 
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Celiac disease 






000 




Cerebral malaria 


1 00 


0 71 




0 71 


Chronic lymphocytic leukemia 


062 


054 


042 


042 


Chronic obstructive pulmonary disease 


092 


072 


072 


072 


Colorectal cancer 


1 00 


079 


064 


032 


Crohn's disease 




03 




030 


Depression 


082 


066 


0 19 


004 




1 00 


0 84 


0 55 


0 00 


Epilepsy 


1 00 








Esophageal cancer 


0 96 


078 


075 


067 


Gastric cancer 


0 96 


0 8 1 


0 42 


0 05 


Gl ioblastoma 


1 00 


0 00 


0 00 


0 00 




0 75 


0 44 


0 30 


0 23 


Hodgkin lymphoma 


1 00 








Hypertension 




047 


049 




Hypertrophic cardiomyopathy 


°oo 


038 


000 


000 


Migrai ne 


1 00 


0 9 1 


0 28 


0 28 


m 1^1 \ 


1 02 


0 36 


0 98 


0 36 


Multiple sclerosis 










Obesity 


0 92 


0 75 


078 


049 




0.92 


0.82 


0.84 


0.78 


Parkinson's disease 


0.89 


0.37 


0.46 


0.24 


Psoriasis 


0.92 


0.29 


0.42 


0.18 


Rheumatoid arthritis 


0.80 


0.37 


0.33 


0.12 


Schizophrenia 


0.99 


0.32 


0.59 


0.23 


Stroke 


0.98 


0.53 


0.29 


0.25 


Systemic lupus erythematosus 


0.99 


0.86 


0.61 


0.47 


Thyroid cancer 


0.95 


0.25 


0.97 


0.16 


Tourette syndrome 


1.00 


1.00 


0.00 


0.00 


Type 1 diabetes 


0.91 


0.70 


0.34 


0.22 


Type 2 diabetes 




0.48 


0.69 


0.47 


Ulcerative colitis 




0.30 


0.34 


0.00 


Venous thrombosis 


1.00 


0.00 


0.59 


000 



The ratio of the root mean squared (RMS) difference between the log of the product of 
likelihood ratios derived from genotyping array SNPs and that derived from sequence data, 
over the RMS likelihood ratios derived from sequence data alone for each disease. 
Dark gray corresponds to a greater relative difference; lighter gray corresponds to a lower 
relative difference; and white (0.00) indicates no difference between the results of 
sequencing and genotyping. 

a substantially different, medically relevant risk profile than that 
available from common genotyping arrays. Although custom 
genotyping arrays can certainly be designed around known 
disease associated variants, it is likely that the continual discovery 
of new associations between variations and disease will outpace 
their design cycle. Indeed, an initial hypothesis of this work was 
that the heavy use of genotyping arrays in genome-wide associ- 
ation studies would skew the results toward highlighting only 
genetic variants already measured on genotyping arrays and thus 
very little difference would be apparent. However, variants 
outside of the scope of genotyping array technology will continue 



Table 3 Relative difference between clinical profiles from genotyping 
array single nucleotide polymorphisms (SNPs) and sequencing by 
disease, using the maximum likelihood ratio model 



Disease 


Affymetrix 

SOOK 
Measured 


Affymetrix 
SOOK 
Imputed 


lllumina 
Omni 
Measured 


lllumina 

Omni 
Imputed 


Abdominal aortic aneurysm 


0 37 


0 00 


0 1 5 


0 00 


Acute lymphoblastic leukemia 






°oo 




Acute myeloid leukemia 


000 


000 




000 


Adenocarcinoma 


1 1 1 


III 


1 00 


1 00 


Adolescent idiopathic scoliosis 


00 


083 


096 


026 


Age related macular degeneration 


0 96 


0 72 


0 01 


0 0 1 


Alcohol dependence 






005 




Allergic rhinitis 


1 00 


loo 


1 00 


"oo 


Alzheimer s disease 


0 53 


0 41 


0 34 


0 34 


Amyotrophic lateral sclerosis 






1.48 


0.00 


Ankylosing spondylitis 


' 00 


0°8 






Anorexia 


1 00 


000 


000 


000 


Asthma 


099 


0 22 


042 


0 01 


Atrial fibrillation 


000 




1 00 


000 


Autoimmune thyroid disease 


000 


000 


000 


000 


Biliary tract cancer 


1 00 


1 00 


1 00 


1 00 


Bipolar disorder 


080 


073 


040 


039 


Bladder cancer 


0 84 


039 


044 


0 18 


Bronchopulmonary dysplasia 




000 


026 


000 


Cardiac hypertrophy 


°\ 00 


1 00 


1 00 


1 00 


Cardiovascular disease 


1 00 




009 


009 


Cataract 


1 00 


1 00 


000 


000 


Celiac disease 


00 








Cerebral malaria 


! 00 


090 


1 00 


090 


Chronic lymphocytic leukemia 


059 


025 


023 


023 


Chronic obstructive pulmonary disease 


092 


0 81 


0 81 


0 81 


Colorectal cancer 


087 


0 21 


047 


000 


Crohn's disease 






03 


00 


Depression 


036 


005 


009 


004 


Eczema 


1 00 


084 


055 


000 


Epilepsy 


1 00 


000 


1 00 


000 


Esophageal cancer 


094 


066 


075 


065 


Gastric cancer 


101 


02 


037 


000 


Glioblastoma 


1 00 


000 


000 


000 


Graves disease 




0 18 


0 12 


0 12 


Hodgkin lymphoma 


?oo 


000 


1 00 


000 


Hypertension 


075 


027 


005 


004 


Hypertrophic cardiomyopathy 


00 


03 




000 


Migraine 


1 00 


1 27 


062 


062 


Multiple myeloma 


0 90 




0 76 


0 14 


Multiple sclerosis 




007 




000 


Obesity 


068 


0 17 


032 


0 13 




1.13 


0.69 


0.73 


0.66 


Parkinson's disease 


1.02 


0.28 


0.23 


0.00 


Psoriasis 


0.61 


0.22 


0.43 


0.25 


Rheumatoid arthritis 


0.51 


0.18 


0.06 


0.02 


Schizophrenia 


0.80 


0.20 


0.39 


0.20 


Stroke 


0.89 


0.30 


0.20 


0.17 


Systemic lupus erythematosus 


0.89 


0.39 


0.28 


0.21 


Thyroid cancer 


1.25 


0.27 


0.90 


0.20 


Tourette syndrome 


1.00 


1.00 


0.00 


0.00 


Type 1 diabetes 


0.85 


0.71 


0.03 


0.00 


Type 1 diabetes 


0.58 


0.45 


0.43 


0.39 


Ulcerative colitis 


0.88 


0.29 


0.31 


0.00 


Venous thrombosis 


1.00 


0.00 


0.93 


0.00 



The ratio of the root mean squared (RMS) difference between the log of the maximum 
likelihood ratios derived from genotyping array SNPs and that derived from sequence data, 
over the RMS of the likelihood ratios derived from sequence data alone for each disease. 
Dark gray corresponds to a greater relative difference; lighter gray corresponds to a lower 
relative difference; and white (0.00) indicates no difference between sequencing and 
genotyping. 

to be discovered, as even in genome-wide association studies 
using arrays, targeted deep sequencing often identifies likely 
'causal' variants, often following the discovery phase. 

Researchers investigating the association between genetics 
and specific diseases can use these results to compare the relative 
power of sequencing versus genotyping arrays to capture 
currently understood risk before they search for novel associa- 
tions or replicate previously reported results. In addition, 
improved prior information can help inform difficult cost- 
benefit analyses and our approach and the results reported here 
may help the design of experiments. 
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Research and applications 



In this analysis, we very deliberately focused on an empirical 
evaluation of the differences in medically relevant genetic variant 
coverage between sequencing and commonly used genotyping 
arrays. Our analysis excludes many types of additional variation 
between the calls reported by arrays and next generation 
sequencing technologies, all of which would likely increase the 
differences we report. Just as genotyping platforms differ in 
which SNPs they measure, sequencing technologies can vary 
considerably not only in the underlying method of sequencing 
but also in the results they report, and these methods can differ 
one from another on DNA from the same individual. 21-23 At the 
same time, there are a variety of different bioinformatics tech- 
niques that can be used to interpret sequencing data and make 
actual base calls, and there can be substantial differences in the 
results reported by different methods, 24 another potential source 
of variation. In our work we also gave 'the benefit of the doubt' to 
imputation methods, although we know that accuracy with 
these approaches is often far from ideal. 25 26 Our exclusion of 
these other sources of variation was intentional to allow us to 
limit the scope of our investigation, but we believe that future 
studies will show even larger differences. 

Genotyping arrays still have an important role in the discovery 
phase of genome-wide disease association studies, as the potential 
space for hypotheses examining associations is incredibly large 
using a full genome sequence. Targeted approaches that focus on 
key genes in disease associated pathways, or using known 
features that enrich for disease association such as expression 
variation, 27 28 may allow the construction of targeted genotyping 
arrays that aid in disease association discovery. In addition, future 
development and use of genotyping arrays specifically designed 
around known human disease associated SNPs may reduce the 
prognosis gap. 

In summary, many important, relatively common, currently 
known disease associated variants are not measurable or 
imputable from commercial genotyping arrays. These variants 
are common enough and have an effect size (likelihood ratio) 
large enough to influence our assumptions of risk for many 
diseases for many individuals. Individuals or their healthcare 
providers deciding between different technologies to elucidate 
potential disease risks encoded in the genome should take into 
account these differences of coverage as well as which disease 
conditions are most relevant to them. Also, researchers 
attempting to discover novel gene— disease associations or 
investigate previously reported associations should consider 
these differences in coverage across individuals as they balance 
cost versus benefit in their research planning. 
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